39 research outputs found
Generalized Max Pooling
State-of-the-art patch-based image representations involve a pooling
operation that aggregates statistics computed from local descriptors. Standard
pooling operations include sum- and max-pooling. Sum-pooling lacks
discriminability because the resulting representation is strongly influenced by
frequent yet often uninformative descriptors, but only weakly influenced by
rare yet potentially highly-informative ones. Max-pooling equalizes the
influence of frequent and rare descriptors but is only applicable to
representations that rely on count statistics, such as the bag-of-visual-words
(BOV) and its soft- and sparse-coding extensions. We propose a novel pooling
mechanism that achieves the same effect as max-pooling but is applicable beyond
the BOV and especially to the state-of-the-art Fisher Vector -- hence the name
Generalized Max Pooling (GMP). It involves equalizing the similarity between
each patch and the pooled representation, which is shown to be equivalent to
re-weighting the per-patch statistics. We show on five public image
classification benchmarks that the proposed GMP can lead to significant
performance gains with respect to heuristic alternatives.Comment: (to appear) CVPR 2014 - IEEE Conference on Computer Vision & Pattern
Recognition (2014
Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation
We address the task of weakly-supervised few-shot image classification and
segmentation, by leveraging a Vision Transformer (ViT) pretrained with
self-supervision. Our proposed method takes token representations from the
self-supervised ViT and leverages their correlations, via self-attention, to
produce classification and segmentation predictions through separate task
heads. Our model is able to effectively learn to perform classification and
segmentation in the absence of pixel-level labels during training, using only
image-level labels. To do this it uses attention maps, created from tokens
generated by the self-supervised ViT backbone, as pixel-level pseudo-labels. We
also explore a practical setup with ``mixed" supervision, where a small number
of training images contains ground-truth pixel-level labels and the remaining
images have only image-level labels. For this mixed setup, we propose to
improve the pseudo-labels using a pseudo-label enhancer that was trained using
the available ground-truth pixel-level labels. Experiments on Pascal-5i and
COCO-20i demonstrate significant performance gains in a variety of supervision
settings, and in particular when little-to-no pixel-level labels are available.Comment: Accepted at CVPR 202
Low-level spatiochromatic grouping for saliency estimation
We propose a saliency model termed SIM (saliency by induction mechanisms), which is based on a low-level spatiochromatic model that has successfully predicted chromatic induction phenomena. In so doing, we hypothesize that the low-level visual mechanisms that enhance or suppress image detail are also responsible for making some image regions more salient. Moreover, SIM adds geometrical grouplets to enhance complex low-level features such as corners, and suppress relatively simpler features such as edges. Since our model has been fitted on psychophysical chromatic induction data, it is largely nonparametric. SIM outperforms state-of-the-art methods in predicting eye fixations on two datasets and using two metrics
Saliency estimation using a non-parametric low-level vision model
Many successful models for predicting attention in a scene involve three main steps: convolution with a set of filters, a center-surround mechanism and spatial pooling to construct a saliency map. However, integrating spatial information and justifying the choice of various parameter values remain open problems. In this paper we show that an efficient model of color appearance in human vision, which contains a principled selection of parameters as well as an innate spatial pooling mechanism, can be generalized to obtain a saliency model that outperforms state-of-the-art models. Scale integration is achieved by an inverse wavelet transform over the set of scale-weighted center-surround responses. The scale-weighting function (termed ECSF) has been optimized to better replicate psychophysical data on color appearance, and the appropriate sizes of the center-surround inhibition windows have been determined by training a Gaussian Mixture Model on eye-fixation data, thus avoiding ad-hoc parameter selection. Additionally, we conclude that the extension of a color appearance model to saliency estimation adds to the evidence for a common low-level visual front-end for different visual tasks
Revisiting the Fisher vector for fine-grained classification
International audienceThis paper describes the joint submission of Inria and Xerox to their joint participation to the FGCOMP'2013 challenge. Although the proposed system follows most of the standard Fisher classification pipeline, we describe a few key features and good practices that significantly improve the accuracy when specifically considering fine-grain classification tasks. In particular, we consider the late fusion of two systems both based on Fisher vectors, but for which we choose drastically design choices that make them very complementary. Moreover, we propose a simple yet effective filtering strategy, which significantly boosts the performance for several class domains